-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rntbd health check improvement 2 #33464
Rntbd health check improvement 2 #33464
Conversation
API change check APIView has identified API level changes in this PR and created following API reviews. |
...smos/azure-cosmos-spark_3_2-12/src/main/scala/com/azure/cosmos/spark/CosmosClientCache.scala
Outdated
Show resolved
Hide resolved
...om/azure/cosmos/implementation/directconnectivity/rntbd/RntbdClientChannelHealthChecker.java
Outdated
Show resolved
Hide resolved
/azp run java - cosmos - spark |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run java - cosmos - tests |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Annie - Kudos! Very clear design and implementation.
/azp run java - cosmos - tests |
Azure Pipelines successfully started running 1 pipeline(s). |
Failed tests: |
/check-enforcer override |
Continuous efforts to improvement Rntbd health check flow, especially for timeout detection.
Why the changes are needed
Based on few recent latency investigations, there are few patterns being identified which demand more aggressively connection closure.
Changes included in this PR:
timeoutDetectionEnabled
: Defaulttrue
timeoutDetectionDisableCPUThreshold
: Default90.0
timeoutDetectionTimeLimit
: Default60s
timeoutDetectionHighFrequencyThreshold
: Default3
timeoutDetectionHighFrequencyTimeLimit
: Default10s
timeoutDetectionOnWriteThreshold
: Default1
timeoutDetectionOnWriteTimeLimit
: Default6s
Few timeout scenarios would trigger a connection to be closed:
timeoutDetectionTimeLimit
: timeout has been observed, it does not matter how many timeout have been observed. This will help to detect a broken connection for sparse workload.timeoutDetectionHighFrequencyThreshold
+timeoutDetectionHighFrequencyTimeLimit
: Timeout has happened very frequently, in this case, we would want to close the channel more frequently.timeoutDetectionOnWriteThreshold
+timeoutDetectionOnWriteTimeLimit
: Timeout happened on write related operation. Since for write operation, only primary replica will be used, so we want to close the channel more aggressively as well.timeoutDetectionDisableCPUThreshold
: High CPU can cause high number of request timeout, when this happens, closing existing channels and re-establishing new ones will not help the situation but rather make it worse. When the cpu threshold being hit, timeout detection will be disabled and then it will automatically resumed when the CPU usage back below the configured threshold.